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Abstract 

We build penalized least-squares estimators using the slope heuristic and re- 
sampling penalties. We prove oracle inequalities for the selected estimator with 
leading constant asymptotically equal to 1. We compare the practical perfor- 
mances of these methods in a short simulation study. 
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1 Introduction 

The aim of model selection is to construct data-driven criteria to select a model among a 
given list. The history of statistical model selection goes back at least to Akaike [1], [2] and 
Mallows |18j . They proposed to select among a collection of parametric models the one 
which minimizes an empirical loss plus some penalty term proportional to the dimension 
of the model. Birge & Massart [8] and Barron, Birge &: Massart [6] generalized this 
approach, making in particular the link between model selection and adaptive estimation. 
They proved that previous methods, in particular cross-validation (see Rudemo [20]) and 
hard thresholding (see Donoho et.al. jl2j ) can be viewed as penalization methods. More 
recently, Birge & Massart [H], Arlot & Massart and Arlot 0], (see also [3]) arised the 
problem of optimal efficient model selection. Basically, the aim is to select an estimator 
satisfying an oracle inequality with leading constant asymptotically equal to 1. They 
obtained such procedures thanks to a sharp estimator of the ideal penalty pen^^. We will 
be interested in two natural ideas, that are used in practice to evaluate pen^^ and proved 
to be efficient in other frameworks. The first one is the slope heuristic. It was introduced 
in Birge &: Massart [9] in Gaussian regression and developed in Arlot & Massart [5] in 
a M-estimation framework. It allows to optimize the choice of a leading constant in 
the penalty term, provided that we know the shape of pen^^. The other one is Efron's 
resampling heuristic. The basic idea comes from Efron |14] and was used by Fromont |15] 
in the classification framework. Then, Arlot [1] made the link with ideal penalties and 
developed the general procedure. Up to our knowledge, these methods have only been 
theoretically validated in regression frameworks. We propose here to prove their efficiency 
in density estimation. Let us now explain more precisely our context. 

1.1 Least-squares estimators 

In this paper, we define and study efficient penalized least-squares estimators in the den- 
sity estimation framework when the error is measured with the L^-loss. We observe n 
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i.i.d random variables defined on a probability space (fi,^, P), valued in a 

measurable space (X, X), with common law P. We assume that a measure fi on (X, X) is 
given and we denote by L'^ifJ-) the Hilbert space of square integrable real valued functions 
defined on X. L'^ip) is endowed with its classical scalar product, defined for all t,t' in 
by 

< t,t' >= / t{x)t' {x)dfi{x) 



and the associated L -norm ||.||, defined for all t in L {fi) by ||t|| = t,t >. The 
parameter of interest is the density s of P with respect to /i, we assume that it belongs 
to L^(/Lt). The risk of an estimator s of s is measured with the L^-loss, that is ||s — s|p, 
which is random when s is. 

s minimizes the integrated quadratic contrast PQ{t), where Q : L'^{ij) — ?• L^{P) is defined 
for all t in Lp'ip) by Q{t) = — 2t. Hence, density estimation is a problem of M- 
estimation. These problems are classically solved in two steps. First, we choose a "model" 
Sm that should be close to the parameter s, which means that mit^Sm Ik ~ ^IP is "small". 
Then, we minimize over Sm the empirical version of the integrated contrast, that is, we 
choose 

Sm G arg min PnQ{t). (1) 

m 

This last minimization can be computationaly untractable for general sets S'm, leading to 
untractable procedures in practice. However, it can be easily solved when Sm is a linear 
subspace of L'^ifJ.) since, for all orthonormal basis {ipx)x£m: 

^(P„Va)V'a. (2) 

Thus, we will always assume that a model is a linear subspace in L'^ip). The risk of the 
least-squares estimator Sm defined in ([T]) is then decomposed in two terms, called bias and 
variance, thanks to Pythagoras relation. Let Sm be the orthogonal projection of s onto 

II _ - ||2 _ II _ ||2 I II _ - ||2 
II S Sm\\ — ll'S Sm\\ i \\Sm Sm|| • 

The statistician should choose a space Sm realizing a trade-off between those terms. Sm 
must be sufficiently "large" to ensure a small bias \\s — Smp, but not too much, for the 
variance \\sm — Sm|P not to explose. The best trade-off depends on unknown properties 
of s, since the bias is unknown, and on the behavior of the empirical minimizer Sm in 
the space Sm- Classically, Sm is a parametric space and the dimension dm of Sm as a 
linear space is used to give upper bounds on Dm = nE — • This approach 

is validated in regular models under the assumption that the support of s is a known 
compact, as mentioned in section [31 However, this definition can fail dramatically because 
there exist simple models (histograms with a small dimension dm) where Dm is very 
large, and infinite dimensional models where Dm is easily upper bounded. This issue is 
extensively discussed in Birge [7]. Birge chooses to keep the dimension dm of Sm as a 
complexity measure and build new estimators that achieve better risk bounds than the 
empirical minimizer. His procedures are unfortunatly untractable for the practical user 
because he can only prove the existence of his estimators. Even his bounds on the risk 
are only interesting theoretically because they involve constants which are not optimal. 
We will not take this point of view here and our estimator will always be the empirical 
minimizer, mainly because it can easily be computed, see ([2]). We will focus on the quantity 
Dm./n and introduce a general Assumption (namely Assumption [V]) that allows to work 
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indifferently with D^jn or with the actual risk \sm — Sm.|p. We will also provide and 
study an estimator of Dm/n based on the resampling heuristic. 

We insist here on the fact that, unlike classical methods, we will not use in this paper 
strong extra assumptions on s, like \\s\\^ < oo or assume that s is compactly supported. 



1.2 Model selection 

Recall that the choice of an optimal model Sm is impossible without strong assumptions 
on s, for example a precise information on its regularity. However, under less restrictive 
hypotheses, we can build a countable collection of models {Sm)m&Mn,-> growing with the 
number of observations, such that the best estimator in the associated collection {sm)m&Mu 
realizes an optimal trade-off, see for example Birge &: Massart [8] and Barron, Birge & 
Massart [6]. The aim is then to build an estimator rh such that our final estimator, s = Sm 
behaves almost as well as any model mo in the set of oracles 

= {nio £ Mn, - sW'^ = inf - s|P}. 

m£Mn 

This is the problem of model selection. More precisely, we want that s satisfies an oracle 
inequality defined in general as follows. 

Definition: (Trajectorial oracle inequality) Let (pn)nGN be a summahle sequence and let 
{Cn)n&i and {Rm,n)neN be sequences of positive real numbers. The estimator s = Sm 
satisfies a trajectorial oracle inequality TO{Cn, {Rm,n)meM„jPn) if 

yn€n*,F(\\s-sf >Cn inf { ||s - S^f + i?m,n} ) < Pn- (3) 
y m€Mn J 



When s satisfies an oracle inequality, Cn is called the leading constant. 

In this paper, we are interested in the problem of optimal model selection defined as 

follows. 

Definition: (Optimal model selection) We say that s is optimal or that the procedure 
of selection {Xi, Xn) ^ rh is optimal when s satisfies a trajectorial oracle inequality 
T0{1 + rn, {Rm,n)meM„jPn) wti-h r^ ^ and for all n in N* and m in Ain Rm,n = 0. In 
order to simplify the notations, when s is optimal we will say that s satisfies an optimal 
oracle inequality OTO{rn,Pn)- 

In order to build m, we remark that, for all m in A^„, 

\\S Sm\\ — Pmll "^P^m ~\~ PH — PnQi^^m) ~l~ '^^ni^^m) ~\~ i (4) 

where i>n = Pn — P the centered empirical process. An oracle minimizes \\s — 
and thus PnQ{sm) + 2^'n(sm). As we want to imitate the oracle, we will design a map 
pen : — >■ M+ and choose 

rh G arg min PnQ{sm) + pen(m), s = Sm- (5) 
meMn 

It is clear that the ideal penalty is penj^(m) = 2un{sm)- For all m in for all orthonor- 
mal basis (V'A)Aem, Sm = EAGm(^™V'A)V'A and s„ = EAem(^^A)^A, thus 
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Let us define, for all m in Mn 

p{jn) — l^ni'^m — \\Sm S 

From ([Ij), for all m in Aim 

\\s-sf = \\sf-2Ps+\\sf = \\sf-2PnS + 2UnS+\\sf 

< PnQ{sm) + pen(m) + (2f„(s) - pen(m)) + ||s||^ 

= ||s-Sm.|| + (pen(m) - 2f„(sm)) + (2i/„(s) - pen(m)) 

Hence, for all m in 

\\s - s|p < ||s - Sm\\^ + (pen(m) - 2p{m)) + (2p(m) - pen(rh)) + 2i/„(sm - Sm)- (6) 
Let us define, for all ci, C2 > 0, the function 

^,,,:M+^M+, T^-^ ^[ . (7) 

[ +00 if X > 1/C2 

It comes from inequality dH]) that s satisfies an oracle inequality OTO{f2,2{^n) , Pn) as soon 
as, with probability larger than 1 — pn 

^ .. |2p(m) - pen(m)| ^ , 

Vm e A^n — ^' < Cn and (8) 

II "S II 

V(m,m)GA4„, ; — p — 2 < e^- (9) 

ll'S 'Sm'll ~l~ l|S 

Inequality ([9]) does not depend on our choice of penalty, we will check that it can easily 
be satisfied in classical collections of models. In order to obtain inequality ([8]), we use two 
methods, defined in M-estimation, but studied only on some regression frameworks. 



1.2.1 The slope heuristic 

The first one is refered as the "slope heuristic". The idea has been introduced by Birge 
&: Massart [S] in the Gaussian regression framework and developed in a general algorithm 
by Arlot & Massart [5]. This heuristic states that there exist a sequence {Am)m&Mn and 
a constant -ftTmin satisfying the following properties, 

1. when pen(m) < K^i^Am., then A„-^ is too large, typically A,;^^ > C max^GAln ^m, 

2. when pen(m) ~ (ii'min + S)Am for some 5 > 0, then is much smaller, 

3. when pen(m) ~ 2i^minAm, the selected estimator is optimal. 

Thanks to the third point, when Am and -ftTmin are known, this heuristic says that the 
penalty pen(m) = 2K^i^Am selects an optimal estimator. When A^ only is known, the 
first and the second point can be used to calibrate -RTmin in practice, as shown by the 
following algorithm (see Arlot &: Massart [5]): 

Slope algorithm 

For all K > 0, compute the selected model rh{K) given by ^ with the penalty pen(m) = 
KAm and the associated complexity Am[K)- 

Find the constant i^min such that A,^(-^) is large when K < K^[^, and "much smaller" 
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when K > -ftTmin- 

Take the final m = m(2ii'min). 

We will justify the slope heuristic in the density estimation framework for A^, = E(||sm — 
SmiP) = Dm/n and A'min = 1- In general, is unknown and has to be estimated, we 
propose a resampling estimator and prove that it can be used without extra assumptions 
to obtain optimal results. 

1.2.2 Resampling penalties 

Data-driven penalties have already been used in density estimation in particular cross- 
validation methods as in Stone [21], Rudemo [20] or Celisse [Tl]. We are interested here 
in the resampling penalties introduced by Arlot [i]. Let {Wi, ...,Wn) be a resampling 
scheme, i.e. a vector of random variables independent of X,Xi^ ■■■,Xn and exchangeable, 
that is, for all permutations r of (1, ...,n), 

{Wi, ...,Wn) has the same law as [W^(^i), 

Hereafter, we denote by Wn = J2^=i ^il'^ ^'^^ by and C)^ respectively the expectation 
and the law conditionally to the data X, Xi, X„,. Let = 'Y^=\ Wi^Xi/n, = — 
WnPn be the resampled empirical processes. Arlot's procedure is based on the resampling 
heurististic of Efron (see Efron |13j). which states that the law of a functional F{P,Pn) 
is close to its resampled counterpart, that is the conditional law {CwF{WnPn, P^!^))- 
Cw is a renormalizing constant that depends only on the resampling scheme and on F. 
Following this heuristic, Arlot defines as a penalty the resampling estimate of the ideal 
penalty 2Dm/n, that is 

pen(m) = 2CH/E^(^.f (sj^)), (10) 

where minimizes P^Q{t) over Sm- We prove concentration inequalities for pen(m) 
and deduce that pen(m) provides an optimal procedure. 

The paper is organized as follows. In Section [51 we state our main results, we prove the 
efficiency of the slope algorithm and the resampling penalties. 

In Section [3l we compute the rates of convergence in the oracle inequalities using classical 
collections of models. Section H] is devoted to a short simulation study where we compare 
different methods in practice. The proofs are postponed to Section [5l Section [6] is an 
Appendix where we add some probabilistic material, we prove a concentration inequality 
for Z^, where Z = supjg^f„(t) and B is symmetric. We deduce a simple concentration 
inequality for U -statistics of order 2 that extends a previous result by Houdre & Reynaud- 
Bouret [16]. 

2 Main results 

Hereafter, we will denote by c, C, K, n, L, a, with various subscripts some constants that 
may vary from line to line. 

2.1 Concentration of the ideal penalty 

Take an orthonormal basis ('0A)Aem of Sm- Easy algebra leads to 

Sm = y^^{P'lpx)'lpX, Sm = {Pn'>Px)'>Px, thuS \\Sm - Smf = {l^nitp x)f ■ 

Asm Asm ASm 
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Sm. is an unbiased estimator of Sm and 

penj^(m) = 2l/„(Sm) = '^J^n{Sm. - Sm) + 2f„(Sm) = 2\\Sm - SmiP + 2Vn{Sm)- 

For all m,m' in Ain, let 

p(m) = ||Sm - Smll^ = ^(t'n(^A))^, m') = 2f„ (s^ - S-m' ) • (11) 

From ([6]), for all m in A^n; 

||s — SII2 < ||s — Sm\\2 + (pen(m) — 2p{m)) + {2p{m) — pen(m)) + 5{rh, m). (12) 

In this section, we are interested in the concentration of p[m) around E(p(m)) = Dm/n. 
Let us first remark that, for all m in p{m) is the supremum of the centered empirical 
process over the ellipsoid Bm = {t € Sm, \\t\\ < !}• From Cauchy-Schwarz inequality, for 
all real numbers (6a) Asm; 



^bl= I sup axbx 

Agm \E«i<lAGm / 



(13) 



We apply this inequality with bx = Vnii^x)- We obtain, since the system (^A)AGm is 
orthonormal, 



2 

= sup {iyn(t)f ■ 



Y^{Vn{tp\)f = sup ^ axVn{tp\) = SUp U'n ^ OaV'A 
Asm E«i<l VAGm / E4<1\ \Aem // 

Hence, p{m) is bounded by a Talagrand's concentration inequality (see Talagrand |22|). 
This inequality involves Dm = nR (pm — SmlP) and the constants 

em = — sup and t;^ = sup Var(t(X)). (14) 

More precisely, the following proposition holds: 

Proposition 2.1 Let 6e iid random variables with common density s with 

respect to a probability measure ^. Assume that s belongs to L?'{^jl) and let Sm be a 
linear subspace in L^{fi). Let Sm and Sm be respectively the orthogonal projection and the 
projection estimator of s onto Sm- Let pirn) = — •SmPi Dm = raE(p(m)) and let Vm, 
em be the constants defined in |i^[ ). Then, for all x > 0, 



.3/4 

n n 



f>(RlIl- p(m) > ^-^P'^'i^rn.^')'^' + 1-71a/A;A^ + ^MemxA ^ 2 ^^_,/2o .^g. 

In n j 

Comments : From (I12p . for all m in A4n, 



I _ <r II _ ||2 

\S ^\\2 — II II 2 



+ (^pen(m) - 2^^) + 2 (^:^ - p(m) j 



+2 ( p(m) ^ j + (2 — — — pen(m) ) + (5(m, m). (17) 
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It appears from (1170 that we can obtain oracle inequalities with a penalty of order 2Dm/n 
if, uniformly over m,m' in Ain, 

pirn) — « \\s — SrnW^ and 5{m' , m) « lis — Smll^ + \\s — Sm' 11^ • 

n 

Proposition 12.11 proves that the first part holds with large probability for all m in 7W„ 
such that Cm Vw^ << nE(||s — Sm|P)- Actually, the other part also holds under the same 
kind of assumption. 

2.2 Main assumptions 

For all m, m' in Ain, let Dm = riK [\\sm — SmW^), 

TU< l\\ " \\2\ II ||2 I 
=E(||S-Sm|| )=||S-Sm|| H , 

n n 

'"mm'= sup Var(t(X)),em,m' = - sup 

te5„+S'„,,||t||<i " te5„+5„,,||t||<i 

For ah ken, let M'^ = {m e Mn, Rm e[k,k + 1)}. For aU n in N, for all k > 0, k' > 
and 7 > 0, let [k] be the integer part of k and let 

ln,^{k, k') = ln(l + Card(7W W)) + ln(l + Card(7Wf ])) + ln((A; + l){k' + 1)) + (In (18) 

Assumption [V]: There exist 7 > 1 and a sequence (e„)neN; with — ?• such that, for 
all n in N, 



,,2 
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(fc,fc')G(N*)2 (m,m')eA4fexA4fe' 



[BR] There exist two sequences {h*^)nm* (/i°)neN* with (/i° V /i* ) as n 00 
s?/c/i i/iat, for all n in W , for all mo G argminmg_A4^ Rm and all m* G argmaXmeA4n ^m, 

Rnio ^ JO nWs — Sm* IP , , * 

Um* ^m* 

Comments: 

• Assumption [V] ensures that the fluctuations of the ideal penalty are uniformly 
small compared to the risk of the estimator Sm- Note that for all /c, k' , ln,'yik, k') ^ 
(Inn)"^, thus. Assumption [V] holds only in typical non parametric situations where 
Rn = mfm(^M„ Rm -)> oo as n oo. 

• The slope heuristic states that the complexity of the selected estimator is too 
large when the penalty term is too small. A minimal assumption for this heuristic 
to hold with Am = Dm would be that there exists a sequence (^n)nGN* with 0„ — t- 
as n — )• 00 such that, for all n in N*, for all ruo G argminmGA1„ E (||s — Sm|P) and 
ah m* G argmaxme;v4„ E [\\sm - SmP), 

-^rrio — ^TiDm* ■ 

Assumption [BR] is slightly stronger but will always hold in the examples (see 
Section [3]). 
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In order to have an idea of the rates Rn, e„, /i* , /i° and 9n, let us briefly consider the very 
simple following example: 

Example HR: We assume that s is supported in [0, 1] and that {Sm)m£Mn collection 
of the regular histograms on [0, 1], with dm = 1, ■■■,n pieces. We will see in Section 13.21 
that Dm ~ dm asymptotically, hence Dm* ~ n. Moreover, we assume that s is Holderian 
and not constant so that there exist positive constants q, Cu, 0(i,au such that, for all m in 
Ain, see for example Arlot [1], 

r/r/""' < lis — s iP < r r/^"" 
'^i"'m — IP ''mil _ 

In Section \'6.'2[ we prove that this assumption implies [V] with e„ < Cln(n)n~^/(^'^'+^). 
Moreover, there exists a constant C > such that Rm^ < infmeX„ {cund'^^ + dm) < 
Cn-i/(2°-+i), thus RmjD*^ < CnV(2"^.+i)-i ^ Q^-2a^/{2a^+i) _ g^^^g ^^lere exists C > 
such that n\\s - Sm*|lV^m' < Cd"":^ = Cn''^^, [BR] holds with /i° = Cn-2°-/(2"-+i) 
and /i* = Cn~°^". 

Other examples can be found in Birge &: Massart [8], see also Section [3l 



2.3 Results on the Slope Heuristic 

Let us now turn to the slope heuristic presented in Section [1.2.11 

Theorem 2.2 (Minimal penalty) Let Mn be a collection of models satisfying [V] and 
[BR] and let e* = V /i* . 

Assume that there exists < (5„ < 1 such that < pen{m) < (1 — 6n)Dm/n. Let ?n, s be 
the random variables defined in ^ and let 

^ 5n - 28e; 
1 + I6en ■ 

There exists a constant C > such that, 

P (Dm > CnDm*, \\s - sf > ^ inf \\s - Smf) > 1 " Cc-^^i'^^^'. (19) 

Comments: Assume that pen(m) < (1 — 6)Dm/n, then, inequality ()19p proves that an 
oracle inequality can not be obtained since Cji/h^ — y oo. Moreover, Dm ^ cDm* is as large 
as possible. This proves point 1 of the slope heuristic. 

Theorem 2.3 Let Ain be a collection of models satisfying Assumption [V]. Assume that 
there exist (J"*" > (5„ > — 1 and < p' < 1 such that, with probability at least 1 — p' , 



2^ + 5^^< pen{m) < 2^ + ,5+^ 
n n n n 



Let m, s be the random variables defined in ^ and let 

There exists a constant C > such that, with probability larger than 1 — p' — Ce~^^^^'^^"' , 
D^<Cn{6.,6+)Rm^, \\s-~sf <Cn{6-,6+) inf \\s-Smf. (20) 

meMn 



8 



Comments 



• Assume that pen(m) = KDm/n with K > 1, then inequahty ()20p ensures that 
Dm < Cn{K,K)Rma- Hence, Djf^ jumps from Dm* (Theorem I2.2p to Rmo (pOj) when 
pen(m) is around Dm/n^ which is much smaller thanks to Assumption [BR]. This 
proves point 2 of the slope heuristic. 

• Point 3 of this heuristic comes from inequality (j20p applied with small (5_ and 5"*". 
The rate of convergence of the leading constant to 1 is then given by the supremum 
between (^_, (5"*" and en- 

• The condition on the penalty has the same form as the one given in Arlot & Massart 
[5]. It comes from the fact that we do not know Dm/n in many cases, therefore, it 
has to be estimated. We propose two alternatives to solve this issue. In Section 12. 4^ 
we give a resampling estimator of Dm- It can be used for all collection of models 
satisfying [V] and its error of approximation is upper bounded by e„i?„i/n. Thus 
Theorem 12.31 holds with [5- V 5~^) < Ce-n- In Section [3.21 we will also see that, in 
regular models, we can use dm instead of Dm and the error is upper bounded by 
CRm/Rnio, thus Theorem [13] holds with V 6+) < C/Rm, « p' = 0. In 
both cases, we deduce from Theorem 12.31 that the estimator s given by the slope 
algorithm achieves an optimal oracle inequality OTO{K€n,Ce~^^^^^^^)- In Example 
HR, for example, we obtain e„ = Cn~^/(^'^'+^) Inn. 

2.4 Resampling penalties 

Optimal model selection is possible in density estimation provided that we have a sharp 
estimation of Dm = nE (sup^g^^ . We propose an estimator of this quantity 

based on the resampling heuristic. The model selection algorithm that we deduce is the 
same as the resampling penalization procedure introduced by Arlot [1]. Let F be a fixed 
functional. Efron's heuristic states that the law C{F{i'n)) is close to the conditional law 
C'^ {Cw F {v^ )) , where Cw is a normalizing constant depending only on the resampling 
scheme and the functional F. Let = Yll=i^i^Xi/n and = — WnPn- The 
resampling estimator of Dm is D^ = nC^^E,^ (sup^g^^ (i^^(t))^) and the resampling 
penalty associated is pen(m) = 2D^/n. Actually, the following result describes the 
concentration of D^ around its mean Dm and around np{m)- 

Proposition 2.4 Let {Wi, Wn) be a resampling scheme, let Sm be a linear space, Bm = 
{t G Sm, ^ 1}> p{inn) = supjg^^ Dm = nE (p(m)) and let D^ be the resam- 
pling estimator of Dm based on (VFi,..., that is D^ = nC^^K^ (sup^g^^^ , 
where = Var{Wi - Wn) and = (v^)'^ - 
Then, for all m in Mn, E(L'J^) = Dm- Moreover, let 

Cm; Vm be the quantities defined in 
^14\ )- For all X > 0, on an event of probability larger than 1 — 7-8e~^ , 




9DT{emxY/^ + 7.61y/vlDmX 



(21) 



n- 1 




b-SlDTiemX^/^ + 3^/vlDmX + 3vlx 
n — 1 



(22) 
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For all X > 0, 

In n — 1 I ~ 

(23) 

P ( ^-p(m) < ^^^'(^rnx')'^' + 7.61 y/^^A;:^ + e^(40.3x)^ \ ^ ^ ^^.^^ 
In n — 1 y 

Remark 

The concentration of the resamphng estimator involves the same quantities as the concen- 
tration of p{m), thus, it can be used to estimate the ideal penalty in the slope heuristic's 
algorithm presented in the previous section without extra assumptions on the collection 
A4n- Proposition 12.41 and Theorem 12.31 prove that this resampling penalty leads to an 
efficient model selection procedure. However, we do not need to use the slope heuristic in 
our framework to obtain an optimal model selection procedure as shown by the following 
theorem. 

Theorem 2.5 Let Xi, X„ be i.i.d random variables with common density s. Let Ain be 
a collection of models satisfying Assumption [V]. Let Wi, ■■■,Wn be a resampling scheme, 
let Wn = XlILi "^w — ~ ^n) and Cw = '^{v'lv)'^- -^ei s be the penalized 

least-squares estimator defined in ^ with 

pen{m) = Cw^^ ( sup {u^ {t))A . 

Then, there exists a constant C > such that 

f( p-sf < (l + 100e„) inf ||s - s^f) > 1 - Ce-sC'^")^ (25) 

\ niGMn J 

Comments : The main advantage of this results is that the penalty term is always 
totally computable. Unlike the penalties derived from the slope heuristic, it does not 
depend on an arbitrary choice of a constant Kj^i^ made by the observer, that may be 
hard to detect in practice (see the paper of Alot & Massart [5] for an extensive discussion 
on this important issue). However, Cw is only optimal asymptotically. It is sometimes 
useful to overpenalize a little in order to improve the non-asymptotic performances of our 
procedures (see Massart [19]) and the slope heuristic can be used to do it in an optimal 
way (see our short simulation study in Section U]). 

2.5 A remarks on the " regular ization phenomenon" 

The regularization of the bootstrap phenomenon (see Arlot O H] and the references 
therein) states that the resampling estimator Cw^^ of a functional F{vn) con- 
centrates around its mean better than F[vn). This phenomenon can be justified with our 
previous results for our functional F. Recall that we have proven in Proposition 12.11 that . 
for all X > 0, with probability larger than 1 — S.Se"^/^", 



/ \ L)m 

p[m) 

n 



n 

In Example HR, we have the following upper bounds 

Ji <^ rl <: ^"^ 2 ^ II I 

n 
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Thus, there exists a constant C such that, for all x > 0, 

2^ 



¥\\np{m) - Drn\ > Cd„. i^f^+ (^) ) ) < S.Se-^'/^o. (26) 



On the other hand, it comes from Inequalities (|2ip and (|22p . that, for all x > 0, on an 
event of probability larger than 1 — 7.8e~^/^'', 

T7 + — r 

15 n — I J 



+- 



n — 1 

Thus, there exists a constant C such that, for all x > 0, 



X /XN 2 



\D^-D^\>Cd^{,r- + [^) ]]<7.8e-/^°. 



The concentration of is then much better than the one of np{m). This implies that 
is an estimator of Dm rather than an estimator of np{m). Thus, the resampling 
penalty can be used when Dm/n is a good penalty for example, under [V]. When Dm/n 
is known to underpenalize (see the examples in Barron, Birge & Massart [6]), there is no 
chance that /n can work. 



3 Rates of convergence for classical examples 

The aim of this section is to show that [V] can be derived from a more classical hypothesis 
in two classical collections of models: the histograms and Fourier spaces. We derive the 
rates under this new hypothesis. 



3.1 Assumption on the risk of the oracle 

As mentioned in Section 12.21 Assumption [V] can only hold if there exists 7 > 1 such that 
i2„(lnn)~'^ — )• cxD as 71 — )• 00, where i?„ = infmGA4„ Rm- In our example, we will make the 
following Assumption that ensures that this condition is always satisfied. 

[BR] (Bounds on the Risk) There exist constants > 0, > 0, 7 > 1, and a sequence 
{Sn)nen with On ^ 00 as n 00 such that, for all n in'H* , for all m in Ain 

el{\unf^ <Rn<Rm<Cun''-. 

Comments: Assumption [BR] holds with On = Cn°' for the collection of regular his- 
tograms of example HR, provided that s is an Holderian, non constant and compactly 
supported function (see for example Arlot [3]). It is also a classical result of minimax the- 
ory that there exist functions in Sobolev spaces satisfying this kind of Assumption when 
A4n is the collection of Fourier spaces that we will introduce below. 

We want to check that these collections satisfy Assumption [V], i.e. that there exists 
7 > 1 such that 

sup sup ( ( (j^^) V j^^^] ll,{k, k')\<et 
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For all m G Mn, Rm < Cun°"' , thus for all k > Cun°"', Card^Mn) = 0. In particular, 
we can assume in the previous supremum that k < Cu'n"''^ and k' < Cu'n'^^. Hence, there 
exists a constant k > such that ln[(l + k){l + k')] < ulnn. We also add the following 
assumption that ensures that there exists a constant k > such that, for all /c E N, 
ln{l + Car d{M'^)) < Klnn. 

[PC] (Polynomial collection) There exist constants cm ^ 0, > 0, such that, for all n 
in N; 

Card(A^„) < CA^n"^. 

Under Assumptions [BR] and [PC], there exists a constant k > such that, for all 7 > 1 
and n > 3, 



sup sup S I I 715— I V „ ^, „ I lnA^,k) 



< sup 

(m,m')e(A4„)2 



Rm V R 




3.2 The histogram case 

Let (X, ) be a measurable space. Let {Pm)m&Mn be a collection of measurable partitions 
Pm = {Ix)xem of subsets of X such that, for all m G Mn, for all A € m, < fJ-{I\) < oo. 
Let m in Aim the set Sm of histograms associated to Pm is the set of functions which 
are constant on each \ G m. Sm. is a linear space. Setting, for all X G m, = 
{^y li{Ix))~^ll^, the functions (^A)AG-m form an orthonormal basis of Sm- 
Let us recall that, for all m in Mn, 

Asm AGm ASm ^' 

Moreover, from Cauchy-Schwarz inequality, for all x in X, for all m, m' in Mn 

sup t^{x)< Va(2;)) thus e^.m' = - sup —pr-^- (28) 

*eB„,„/ Aem.Um' ^ ASmUm' /^l^Aj 

Finally, it is easy to check that, for all m ,m' in A4n 

<w= sup Var(^A(X))= sup ^ ^ /a)(1 - P(X E Ia)) _ ^^^^ 

AGmUm' AGmUm' 

We will consider two particular types of histograms. 
Example 1 [Reg] : ^u-regular histograms. 

For all m in Minj Pm '^^ ^ partition of X and there exist a family (^^m)mGA^7i bounded by 
n and two constants Crh, C^h such that, for all m in Mn, for all A G Mn, 

Crh ^ / T \ ^ Crh 

d~ - ^'^^^^ - ~- 

The typical example here is the collection described in Example HR. 



Example 2 [Ada]: Adapted histograms. 
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There exist positive constants Cr, Cah such that, for all m in Mn, for all A € Mn, 
Ijl{Ix) > CrH"^ and 

P{X Gh) ^ 

[Ada] is typically satisfied when s is bounded on X. Remark that the models satisfying 
[Ada] have finite dimension dm < Cn since 

1 > P(X G h) > Cah /"(^a) > CahCrdmn-\ 

Xdm Asm 



The example [Reg]. 

It comes from equations (j271 [28l [29|) and Assumption [Reg] that 

C^^rl — He l|2 < n < r~^rl — He H^ 
'^rh H^™ H — _ ^^./j H'^^H ■ 

_i dfYi V dm' 2 II II II INI II —1/2 II II f~i ; — 

em,m'<C,rh , ^, < SUp || * || oo 11* II II « II < II « II V "m V d^/ . 

Thus 

Rm^ Rm' ^ '"^ 'n{Rm\/ Rm') 

If Z?„VZ)„, < 02(1^^^)27^ 

.,2 



Rm^Rm' - V '■'^ - 0„(lnn)' 

If DmyDm' > el{\nnf\ 



,,2 



RmyRm'~^ Dm^Dm' -0„(lnn)7 

There exists k > such that ^^j^jj^ 7^)27 ^ since for all m in -Rm ^ ?^||s — ■5m|p + 
Crhdm < (||s|P + c~^)n. Hence Assumption [V] holds with 7 given in Assumption [BR] 
and en = C6n ^^"^ ■ 
The example [Ada]. 

It comes from inequalities ([28]), (I29p and Assumption [Ada] that, for all m and m' in 7W„ 

em,m' < and u^ „,, < C^/j. 
Thus, there exists a constant k > such that, for all m an m! in A^„, 

„.2 \2 



f I III,. Ill, t \ I I I L . I I L I 

sup < „ ' „ V „ ^, „ > < 



(m,m')G(A4„) 



V Rm' / i^m V i?™' ( " 02 (In n)^"/ ■ 



Therefore Assumption [V] holds also with 7 given in Assumption [BR] and e„ = nOn 



1/2 
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3.3 Fourier spaces 

In this section, we assume that s is supported in [0, 1]. We introduce the classical Fourier 
basis. Let -00 : [0, 1] — )• M, x i— ?• 1 and, for all A; G N*, we define the functions 

Tpi^k ■ [0, 1] — ^ M, X \/2 cos(27rA;x), V2,fc : [0, 1] — > M, x \/2 sin(27rA;x). 

For ah j in N*, let 

ruj = {0} U{{i,k), i = 1,2, k = l,...,j} and Mn = = l,...,n}. 

For all m in M^n, let Sm be the space spanned by the family {'ipx)\^m- {4'\)\£m is an 
orthonormal basis of Sm and for all j in 1, n, dm^ = 2j + 1. 
Let j in 1, ...n, for all x in [0, 1], 

j 

^ ^^(x) = 1 + 2 ^ cos^(27r/cx) + sin2(27rA:x) = 1 + 2 j = d^, • 

AGmj fc=l 

Hence, for all m in A^n, 

= ^ I ) - INmf = dm - llSmf . (30) 

\Aemj / 

It is also clear that, for all m, m' in A^rn 



dm, V dm/ 



m' < MWdm"^ dm'- (31) 

n 

The collection of Fourier spaces of dimension dm < n satisfies Assumption [PC], and the 
quantities Dm em,m' and v'^ ^, satisfy the same inequalities as in the collection [Reg] , 
therefore, [V] comes also in this collection from [BR]. We have obtained the following 
corollary of Theorem [ 



Corollary 3.1 Let Mn be either a collection of histograms satisfying Assumptions [PC]- 
[Reg] or [PC]- [Ada] or the collection of Fourier spaces of dimension dm < ri. Assume 
that s satisfies Assumption [BR] for some 7 > 1 and On ^ 00. Then, there exist constants 
K > and C > such that the estimator s selected by a resampling penalty satisfies 

|s-5||2< (1 + ^0-1/2) inf \\s-Smf]>l-Ce--2^'-^'^\ 

m£Mn ) 

Comment: Assumption [BR] is hard to check in practice. We mentioned that it holds 
in Example HR provided that s is Holderian, non constant and compactly supported (see 
Arlot [1]). It is also classical to build functions satisfying [BR] with the Fourier spaces in 
order to prove that the oracle reaches the minimax rate of convergence over some Sobolev 
balls, see for example Birge &: Massart [8j, Barron, Birge &: Massart [6] or Massart [19j . 
In these cases, there exist c > 0, a > such that Qn ^ cn". In more general situations, 
we can use the same trick as Arlot [1] and use our main theorem only for the models 
with dimension dm > (Inn)^^^"^, they satisfy [BR] with Qn = (Inn)^, at least when n is 
sufficiently large, because 

\\sf + Rm> \\sf + Dm > cdm > c{ln n)^ {In nf'^ . 

With our concentration inequalities, we can control easily the risk of the models with 
dimension dm < (Inn)^^^'^ by K(lnn)^'''^'''/^ with probability larger than 1 — Ce~2(''^")^ 
and we can then deduce the following corollary. 
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Corollary 3.2 Let Mn be either a collection of histograms satisfying Assumptions [PC]- 
[Reg] or [PC]- [Ada] or the collection of Fourier spaces of dimension dm < n. There 
exist constants k, > 0, r] > 3 + 5j/2 and C > such that the estimator s selected by a 
resampling penalty satisfies 

F(\\s-rsf<(l + K{lnnr')( inf ||, - s^f + 11!^^ ^ > i _ Ce-^P'^")^ 
\ \meMn n J J 

4 Simulation study 

We propose in this section to show the practical performances of the slope algorithm and 
the resampling penalties on two examples. We estimate the density 

s{x) = ^X~^/^llQ^i]{x) 

and we compare the three following methods. 

1. The first one is the slope heuristic applied with the linear dimension dm of the 
models. We observe two main behaviors of dm(K) with respect to K. Most of the 
times, we only observe one jump, as in Figure 1, and we find Kmin easily. 



35 




'o 0.005 0.01 0.015 0.02 0.025 0.03 



Figure 1: Classical behavior of K dm{K) 

We also observe more difficult situations as the one of Figure 2 below, where we 
can see several jumps. In these C9/SGS, clS prescribed in the regression framework by 
Arlot & Massart [5], we choose the constant iCmin realizing the maximal jump of 
dm(K)- Arlot & Massart |5] also proposed to select Xmin as the minimal K such 
that d^i^x) ^ c?m*(liira)~^, but they obtained worse performances of the selected 
estimator in their simulations. 

We justify this method only for collection of models where dm — KDm for some 
constant K. We will see that it gives really good performances when this condition 
is satisfied. 



The second method is the resampling based penalization algorithm of Theorem 12.51 
Note here that all the resampling penalties /n can be easily computed, without 
any Monte Carlo approximations. Actually, for all resampling scheme. 



— = - E - TT E i^x^MX,) 

n n ^-^ \ nin — 1) ^-^ 

Asm \ ^ ' i^j=l 
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Resampling penalties give always good approximations of Dm- However, in non 
asymptotic situations, it may be usefull to overpenalize a little bit in order to improve 
the leading constants in the oracle inequality (in Theorem I2.3t imagine that 46e,i is 
very close to 1). 

3. In a third method, we propose therefore to use the slope algorithm applied with a 
complexity D^. By this way, we hope to overpenalize a little bit the resampling 
penalty when it is necessary. 

4.1 Example 1: regular case 

In the first example, we consider the collection of regular histograms described in example 
HR and we observe n = 100 data. In this example, we saw that ~ Dm — dm- We can 
actually verify in Figure 2 that these quantities almost coincide for the selected model. 

1 OO I r 1 1 1 

90 - 1 

so - 
,., /■(: 

E 

"S 60 - 

-i 

^ 50 - 
.§40- 
<=>30 - 
20 - 
1 O 

\ 

o I 1 ' 

O 100 200 300 O 100 200 300 

Constant K Constant K 

Figure 2: Comparison of dm and Dm on the selected model 

We compute A'" = 1000 times the oracle constant c = \\s — s\\'^ /{mfmeM„ \\s — Sm|P) for 
the 3 methods. We put in the following array the mean, the median and the 0.95-quantile, 
go. 95 of these quantities. 



method 


mean of the N constants c 


median 


90.95 


slope + dm 


3.56 


2.30 


10.07 


resampling 


4.43 


2.52 


15.47 


resampling + slope 


3.57 


2.21 


10.86 



We observe that the slope algorithm allows to improve the resampling penalty in practice. 
This may be due to a little overpenalization even if it is not a straightforward consequence 
of our theoretical results. Note that, as dm. — D^ , the slope algorithm leads to the same 
results when applied with dm or with D^. Although we have an explicite formula to 
compute the resampling penalties, the computation time is much longer if we use . 
Therefore, we clearly recommand to use the slope algorithm with dm for regular collections 
of model, as regular histograms or Fourier spaces described in Section 13.31 

4.2 Example 2: a more complicated collection 

In the next example, we want to show that the linear dimension shall not be used in 
general. Let us consider a slightly more complicated collection. Let k^Ji^J2-,n be four 



80 - 

70 - 

g 60 - 
1 

^ SO - 
£"40 - 
30 - 
2Q - 
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non null integers satisfying k < n, Ji < k, J2 < n — k. We denote by Sk^j^,j2,n the linear 
space of histograms on the following partition. 

Jin Jin 



U 



k 1 

n J2 



, / = 0,...,Ji-l 

1 - k/n 



kin k ^, 

n J2 



0,...J2-1 



Let n G N* and let Mn = ^1,^2) G (N*)^; A; < n, Ji < /c, J2 < n - A;}. It is clear 
that Card(7W„) < n^ . The oracle of this collection is better than the previous one since 
the regular histograms belongs to {Sm,n)m.&Mn- It is easy to check that the dimension of 
'Sk,Ji,J2,n is equal to Ji + J2 and that D^ j^ j^.^ is equal to {nJi/k)F{k/n) + (nJ2/(n — 
k)){l-F{k/n))-\\ Sk,Ji,J2,n\\ /n, where F is the distribution function of the observations. 
Hence, there is no constant Kq such that Kodkjij2,n — ^fc,Ji,J2,n as in the previous 
example. Figure 3 let us see this fact on the selected model. 




Figure 3: Comparison of dm and on the selected model 

We also compute = 1000 times the oracle constant c = ||s — s|p/(infmeA^„ ||s — Sm|P) 
for the 3 methods, taking n = 100 observations each time. The results are summarized in 
following array. 



method 


mean of the N constants c 


median 


90.95 


slope + dm 


8.30 


7.01 


19.73 


resampling 


6.11 


5.08 


13.52 


resampling + slope 


5.33 


4.04 


12.92 



The slope heuristic gives bad results when applied with dm- This is due to the fact that 
dm is not proportional to Dm here. The resampling based penalty 2D^ /n is much better 
and, as in the regular case, it is well improved by the slope algorithm. Therefore, for 
general collections of models where we do not know an optimal shape of the ideal penalty, 
we recommand to apply the slope algorithm with a complexity equal to D^. 
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5 Proofs 

5.1 Proof of Proposition 12.11 

It is a straightforward application of Corollary 16.61 in the appendix. 

5.2 Technical lemmas 

Before giving the proofs of the main theorems, we state and prove some important technical 
lemmas that we will use repeatedly all along the proofs. Let us recall here the main 
notations. For all m, m' in 

p{m) = \\sm - SmW"^, Dm = nE(p(m)) = nE {\\sm - Sm||^) 

Rm = nE {\\s - SmiP) = n\\s - SmW"^ + Dm, S{m, m') = Vn{Sm - Sm')- 

For all n G N*, A; > 0, k' > 0, 7 > 0,, let [k] be the integer part of k and let 

ln,^{k, k') = ln((l + Card(7WM))(l + Card(7WM))) + ln((l + k){l + k')) + {\nny . 
Recall that Assumption [V] implies that, for all m,m' in A4n, 

em,m'iln,-yiRm, Rm'))^ < (^ti{Rm ^ Rm') ■ (32) 

Let us prove a simple result 
Lemma 5.1 For all K > I, 

^i^) = Yl E e-^^^(i+c-'^(-^n))+in(i+fe)] < (33) 

For all m in Ain, let Im = ln,'y{Rm, Rm), then, for all K > Xj^pl, 

e-^''- = 5](2J^2)e-^'{i'^«)\ (34) 

mGMn 

For all m, m' in Ai^i, 1st lm,m' = ^n,'yiRm, Rm'), then, for all K > 1, 

(m,m')e(A1„)2 

Proof : 

Inequality (I33p comes from the fact that, when K > 1, 



V/t G N, Yl e-^[i°(i+c-'i(^"))] < 1, and ^ e"^!'^^^ < oo. 

For all integer k such that ^ 0, for all m in M^, Im > 2[ln(l + Card(A^^)) + ln(l + 
k)] + (Inn)''', thus, for all K > 1/V2, it comes from ()33p that 

meMn kmmeMi^ 



18 



Finally, for all integers {k, k') such that M.^ x / 0, 

Im^m' > ln(l + Card(A^^)) + ln(l + k) + ln(l + Card(7W^')) + ln(l + k') + (Inn)"^. 
Thus, from (f33]l . 



Lemma 5.2 Lei A^n a collection of models satisfying Assumption [V]. VFe consider 
the following events. 



ns = I V(m, m') G m') < 6e„-^™ ^ 



n 



and = n Op. Then there exists a constant C > such that 
Proof : 

Let > 1 be a constant to be chosen later. We apply Lemma 16.81 in the appendix to 
u = Sm — Sjn'i S = Sm + Sm' , L = id, X = K'^ln,'y{Rm-, Rm')- For all 7? > 0, for all m, m' in 
M-ni on an event of probability larger than 1 — e~^ 'n,7(-Rm,-R„/)^ 

,,2 2vl^ ,K'^ln,j{Rrn,Rrn') + ern,rn'{K'^ln,'Y{Rm.,Rm'))'^/9 

d{m,m ) < -\\sm - Sm'W H • (36) 

From [V], for all m, m' in A^n, 

'^^m,m'^ l'n,y[^rn, -tim' )) -\ ^ S [2(Aenj + 



9 - y . 9 J n 

Moreover, for all m,m' in Mm 

\\Sm - Sm'f < 2(||s - SmW"^ + \\s - S^'lP) < 2{Rm + Rm') < ^Rm V Rm') 



Let en{K) = y^(Ke„,)2 + {Ken)'^/18. In §6^ we take r/ = e„(K) and we obtain 

5{m,m') > 4e^(K) ^"^^'"' j < e-^'"--(^'-^™'). (37) 

From (135|), for all K > 1, 

P (^V(?Ti,m') G Ml 6{m,m') > 4e„ (i^) ^"^ ^ ^ < (S(ir))2e-^(''^")'. 

Let K = 1.1 and take n sufficiently large so that K^e'^/18 < 1, then Aen{K) < 6e„. Hence, 
the first conclusion of Lemma l5.2l holds for sufficiently large n, it holds in general, provided 
that we increase the constant C if necessary. 
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We apply Assumption [V] (see (f32]) ) with m = m', let Im = ln,'r{Rm, Rm)i for all K > 0, 
for all n such that 4.06(A'e„)2 < 2, 

< {l.7Ken + 0.15(/f6„)' + (Ken)')— < me^ — . 



n 

n n 

It comes then from Proposition 12. II applied with x = K'^lm that, for all m in 7W„ 

n 71/ 



Thus, from ()34p . for all X > vlO, and for all n sufficiently large, 

Vm G p{m) 3Ken^) < S(i^VlO)e-^(i"")\ 

n n J 

We use the same arguments to prove that 

n n / 



Fixe K = \/l0.5, then for all n sufficiently large , the conclusion of Lemma 15.21 holds. It 
holds in general provided that we increase the constant C if necessary. 

Lemma 5.3 Let (^a)agA an orthonormal system in Lp'{^) and let L he a linear func- 
tional defined on L?'{n). Let p{K) = Yli\&ki^n{L{ipx)))'^ . Let {Wi, ...,Wn) he a resampling 
scheme, let Wn = ZliLi ^i/"' '^"■^ '^w — ^aK^^i ~ ^n)- -^ei 



AeA 

T = ExeAiLii^x) - PL{^x)?, D = PT and 

C^=-7-^ E Y.iLi^x){X,)-PL{i^,)){Liij^)iXj)-PL{i^,)). 

then 



1 n — ^ 

p{A) = -PnT + [/, DY = PnT - U, p{A) - ^ = [/, 

n n n 

E{Df) = D, Df -D = VnT - U. 
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Proof : 

It is easy to check that 



^ n 1 

AgA i=l i=l 
1 



j^j=i aga 
n n 



Recah that i/^ = - For all A in A, since Er=i(^i " ^n) = 0, 

1 " 

z/r(L(VA)) = -^{W^-Wn)L{i:,){X,) 

i=l 

1 " 

= -y2{W^-Wn){L{i^x){X^)-PL{i;x))■ 



i=l 



Thus, if Ei^j = E ((VF, - Wn){Wj - Wn)) 1 1 



^{V^^r^ E ^^"^ ( f ^ E(^« - ^^n)(^(V'A)(X,) - PL(^a))) 

aga y V"" j=i / 



- E E^^^(^(^A)(^.)-^'L(^A))(i(V'A)(^,)-P^^A)). 

i5^j=i aga 

Since the weights are exchangeable, for all i = 1, .., n, ¥.{{Wi-Wnf) = \B.T{Wi-Wn 
and for all i 7^ j = 1, ...,n, 

vlvEi^j = E ((VF, - Wn){Wj - Wn)) = E ((T^i - Wn){W2 - Wn)) ■ 

Moreover, since — Wn) = 0, 



Q = E 



Y.(Wi - Wn 



.1=1 



Y,^{iW^-Wn)^) + ""WE^, 



= nE((T^i - Wn)^) + n{n - 1)E {{Wi - Wn)iW2 - Wn)) . 
Hence, for all i ^ j = 1, ...,n, Eij = — l/(n — 1), thus 

= PnT - U. 

The last inequalities of Lemma 15.31 follow from the fact that E([/) = 0. Finally, 



p(A) 



1 



n — 1 , 



-PnT + 

n n n 



-U 



-PnT --U]=U. 
n n 
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Lemma 5.4 Let 



' ' n n 



m.£Mn 

inn n^lT 



and j7p = n There exists a constant C > such that P(fip) < Ce 2(''^")^. 
Proof : 

From Assumption [V] applied with m = m', (see (f32|) ). if = ln,--/{Rm, Rm), for all 
> 0, 

D^J^emiK^lmfy/^ < KenRm, VvlDmiKHm) < KeA, 
vliK^lm.) < {KenfRm, e^{Kl^f < {Ken)^Rm. 
We apply Proposition 12.41 with x = K'^lm and we obtain 



n 



p{m) > (8.31Ke„ + 3{Kenf + {19 .1)\K Cn)^) < 26" 



Thus, for all K > 1/(^2), if en{K) = n (8.31i^e„ + 3{K€nf + il9.lf{Ken)^) /(n - 1) 
from (l3l 



n n y 

Take K = 8/8.31 and n > 10 sufficiently large to ensure that 2,K'^en + {l9.lfK^el < 1, 
then 

en{K) < y (8e„ + e„) < 10e„. 
We deduce that, for sufficiently large n, 

We also apply Proposition 12.41 with x = K'^lm-, and we use the same arguments to prove 
that, for K = 16/16.61, for all n > 10 sufficiently large to ensure that {A{).2,f K'^e^ < 2 

Vm G Mn: ^-p{m) < -20en — ) < 3.8S(2K2)g-i^2(lnn)^_ 

n n J 

Hence, the conclusion of Lemma 15.41 holds for sufficiently large n. It holds in general, 
provided that we increase the constant C if necessary. 

5.3 Proof of Theorem [Ml 

If On < 0, there is nothing to prove. We can then assume that Cn > 0, this implies in 
particular that 

28e„ < 6n < 1. 

We use the notations of Lemma [5.21 From Lemma [5. 2 1 the inequalities (|19|) will be proved 
if, on Qt, Dm > CnDm* and 

II ~l|2 \ '^ri . r II - ||2 
S — S >— — mf S — • 
5/l° m&Mn 



22 



Let nio S argmirimeA^n ^m, ^ minimizes over the following criterion. 

Crit(?Ti) = PnQ{sm) + pen(m) + + 2f„(smJ 

= \\s — Sm\\^ — p{m) + 6{mo, m) + pen(m). 

Recall that < pen(m) < (1 — 5n)D„Jn. On ily, for all m in M-n-, since Rm > Rmo^ 

Critfm) > lis — SmlP — — 16e„ — — > —(1 + 16e„) — —. 

n n n 

Crit(m) < ||s - SmiP + 26en— - (5„— = (1 + 26ert)||s - Smll^ - ((^n - 26e„) — . 

n n n 



When Dm < CnDm* , 

/' 72 1 1 S — S * 1 1 ^ 

(1 + lQen)Dm < Dm* {5n " 26e„) - (1 + 26e„)- 



Dm* 

Thus Crit(m) > Crit(m*). This implies that Dm > CnDm*- 
Moreover, on ily, we also have, for all m in A4n 

n \ n n 



and 



Thus 



inf \\s-Smf< inf :^(l + 10e„) < :^(l + 10e, 
mGMn meMn n 71 



\\s-sf > (l-20e„)^>(l-20e„):^>(l- 206„)c„^ 

n n n 

l-20e„i?mo . c„l-20e„ . . n . 1,2 
> Cn ; > -, mi s — Sm . 

We conclude the proof, saying that e,„ < 1/28 implies that (l-20e„)(l + 10e,„)-i > 8/38 > 
1/5. 

5.4 Proof of Theorem 

If 5- — 46e„ < — 1, there is nothing to prove, hence, we can assume in the following that 
5- - men > -1. 

We keep the notation ilx introduced in Lemma |5.2[ Let 

i^pen = S hd_ < pen(m) < h >, 

''In n n n \ 

m&Mn 

Q. = Qt n Open and nio G argminmeA4„ Rm- Recall that P(Opcn) > 1 — p' and that, m 
minimizes over m the following criterion. 

Crit(m) = PnQ{sm) + pen(m) + + 2f„(smJ 

= \\s — Sm\\^ — p{m) + 6{mo, m) + pen(m). 
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Therefore, on Q, for all m in Mn, since Rm > Rnio^ 

Crit(m) > {I + 6.)^ + ( - pim)) - 6en— 

n \ n J n 

> (1 + 5_ - 16e„)||s - Smf + (1 + (5- - 16e„)— > (1 + (5„ - 16e„,) — 

n n 

Crit(m) < (1 + J+ +26e„) — . 

n 

If D„>C„(5_,J+)i?„,„, 

(1 + 5_ - 16e„)I?™ > (1 + 5+ + 26e„)i2^,, 

Thus Crit(m) > Crit(mo), hence < Cn{5-,5^)Rmo- 
Moreover, from ([6]), for all m in A^n 

||s — s|P < lis — SmiP + (pen(m) — 2p(m)) + (2p(m) — pen(m)) + (5(m, m) 

\ n J n 

+2 (p(,n)-^')+(-5_+66„):^ 
\ n J n 

< \\s - s^f + (46e„ + S+)^ + {26en - <5-)^. 

n n 

For all m in A^n, on JIt^, 

\\s-smf = ^+ (p{m) -£j^)>{l- 20e^)^. 

n \ n J n 

Hence, for all m G 

I ~i|2^ii . 1,2 , ^Qen + S+ \ , 26e„-J_ „ 2 

S — S < S — Sm 1 H H s — s\\ . 

™" V l-20en J l-20en 

This concludes the proof of Proposition 12.31 
5.5 Proof of Proposition 12.41 

We apply Lemma 15.31 with L = id and A = m. By definition of p{m) and -D^ , 

Thus, from Lemma 16.71 in the appendix, for all x > 0, 

^( , . 5.31Z)^/^(e„x2)i/4 + 2,^vl,Drr,x + 2,vl,x + e^(19.l2;)2 \ ^ _ 

r p(mj > < 2e 

\ n n — 1 / 

In n — 1 / ~ 
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This proves ([23]) and ([2^ 
In order to obtain (I2ip and (I22p . we introduce, for all m in Aln) the function T„ 
SAGm(V'A — Pi'xY ^'^^ ttL6 random variable 

1 " 

We apply Lemma 15.31 with L = id, we obtain 

From Bernstein's inequality (see Proposition 16. 3p . for all x > and all in { — 1, 1}, 



From Cauchy-Schwarz inequality, Tm. = supjg5^^^(t — Pt)^, thus ||Tm||^/n = 4em and 
Var(Tm(X))/n < HTmlloo PT^/n = 4:emDm, therefore, for all x > and all ^ in { — 1, 1}, 



■X 



Moreover, from Lemma 16.71 in the appendix, for all x > 0, 

p ^ 5.31DU\emxY/^ + S^vlDmX + Svjx + e^.(19.1x)^ ^ ^ ^^.^ 

p 1^^^ ^ 9Dg/^e^x^) + 7.61 7^4^;:^ + e^(40.3x)2 ^ ^ ^ ^^.^ 
We deduce that, for all x > 0, with probability larger than 1 — 4.8e~^, 

\ 3 n — 1 / 

^ 9£'^/'(e^x^)V^ + 7.6Vt;2,Z)^x 
n — 1 

Moreover, for all x > 0, on an event of probability larger than 1 — 3e~^, 

w I (^x (19.1x)2\ 

\ 3 n — 1 J 



n — 1 

5.6 Proof of Theorem [211 

Recall that P(17^) < Ce-^('°")\ and that, on Qt, 

R 

Vm G Mn, (1 - 20e„)— < \\s - s^f , 
'im,m' G A^^, 6{m,m') < 6e„- ™ 



n 
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Let r2p be the event defined in Lemma 15.41 and let 17 = Op n from Lemma I5.2| 
P(0'=) < Ce-5(''i'^)''. Recall that pen(m) = 2D^Jn. On Q, from ©, for aU n such that 
20en < 1, for all m in Aim 



Is — slP < lis — Smll^ + 26e„^^^ + 16e ^ 



m 

n n 



<: II _ " l|2 I II _ - ||2 _i_ 16^n II _ ~||2 

- " " l-20e„" " l-20e„" " 

Hence, for all n such that 20e„ < 1, on 0, 

(1 - 36e„)||s - s||^ < (1 + 6e„) inf ||s - Sm|P- 
For aU n such that 42/(1 - 36e„) < 100, 

p-gf < ( 1+ , 1 P-Smf < (l + 100e„) inf ||s - 



Hence (I25p holds for sufficiently large n, it holds in general provided that we enlarge the 
constant C if necessary.. 



6 Appendix 

In this Section, we state and prove some technical lemmas that are useful in the proofs. 
The main tool is the first Lemma based on Bousquet's version of Talagrand's inequality. 
It is a concentration inequality for the square of the supremum of the empirical process 
over a uniformly bounded class of functions. Recall first Bousquet's [lOj and Klein & Rio 
jl7j versions of Talagrand's inequality. 

Theorem 6.1 (Bousquet's bound) Let Xi,...,Xn be i.i.d. random variables valued in a 
measurable space (X, X) and let S be a class of real valued functions bounded by b. Let 
f ^ = supfg5 Var{t{X)) and let Z = sup(g5 Unt. Then 



Vx > 0, fI Z> E(Z) + \ -(v"^ + 2bK(Z))x + — ) < e"^. 
\ \ n 3n / 

Theorem 6.2 (Klein &; Rio's bound) Let Xi, ...,Xn be i.i.d. random variables valued in 
a measurable space (X, X) and let S be a class of real valued functions bounded by b. Let 
f ^ = supjgg Var{t{X)) and let Z = sup(g5 u^t. Then 



Vx > 0, P ( Z < E(Z) - X -(v"^ + 2bmZ))x - — ) < e"^. 
\ \ n 3n I 

Let us now also recall Bernstein's inequality. 
Proposition 6.3 Bernstein's inequality 

Let Xi, ...,Xn be iid random variables valued in a measurable space (X,X) and let t be a 
measurable real valued function. Then, for all x > 0, 



M»(0>/^"^^^™^%^ <e- 
n 3n / 
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We derive from these bounds the following useful corollary. Hereafter, S denotes a symetric 
class of real valued functions upper bounded by b, = sup^gg Var(t(X)), Z = supjg5 Vnt, 
'nM{Z'^) = D. Since S is symetric, we always have Z > 0. 

Corollary 6.4 Let S be a symetric class of real valued functions upper bounded by b, 
= supfg5 Var{t{X)), Z = sup^^g Unt, nE(Z^) = D, ei) = b'^/n and 

nEm = 22566 + (2.1 + \/2^) Vv^? + ^/ISD^/^y^, 



then 

72-1 \ ^ I'TC' I '7^^2 



E(Z^l2>E(z)) < (E(Z))^P (Z > E(Z)) + E^. (38) 
In particular, 

(E(Z))2 < K{Z^) < {E{Z)f + Em. (39) 

Proof : 

We have 

f'OO POO 

E{ZHz>E(z)) = / nZ^lz>EiZ) > x)dx = / F(Zl^>E(z) > V^)dx 
Jo Jo 

/•oo 

= (E(Z))¥ {Z > K{Z)) + / F{Z> ^)dx 

Take x = (E(Z) + ^2{v'^ + 2b'E{Z))y /n + 6y/(3n))^ in the previous integral, from Bous- 
quet's version of Talagrand's inequality, 

- V n Jo y/y n Jo 

Classical computations lead to 

poo g~y /"CXD rOO POO 

/ = 2 / e'^y/ydy = ^/tt, / e"^(iy = / ye'^dy = 1. 

Jo \jy Jo Jo Jo 

Therefore, if e^ = fo^/n, using repeatedly the inequalities 

a"fe^-" < aa + (1 - a)b (40) 
and \/a + 6 < y/a + Vb, we obtain, for all i] > 0, 



(V^E(Z))V2ef < U^\V^HZ)f/' 



+ 



2eb 



Thus 

7mZ){ekf^ 



EiZ^lz>Eiz)) < (^2v^ + ^eb + v^^^'^^ + V^- 



n 

(V?lE(Z))3/2 (^^^1/4 
n 



/ yf2^\ [2^ , , 2 y/2^ 2J^ 14 \ e?, 



3 9 / W n 
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Therefore, taking r] = 0.088, we obtain 

- ^ ' 71 n n n 

Finally, we use Cauchy-Schwarz inequality to obtain that ■^/nE(Z) < (riE(Z^))"'^/^ 
Since v"^ < D, we get 



We deduce from this result the following concentration inequalities for 
Corollary 6.5 Let Cb = b'^/n. We have, for all x > 0, 

\ n n I ~ 

Moreover, for all x > 0, with probability larger than 1 — , 

D 2 ^ D3/4gi/4^^^4_^27V^) + \/^(4.61 + 3V^) + 225efe(6.2x2 + l) 
n n 
Proof : 

Prom Bousquet's version of Talagrand's inequality and from (E(Z))^ < E(Z^), we 
obtain that, for all x > 0, with probability larger than 1 — e~^, — D/n is not larger 
than 



4Z)3/4(e^^2)i/4 ^ /D(i4y^/3 + 2^2^) + AD^/^{ebx'^flyZ + ^v'^x + efexVs 

n 

We use repeatedly the inequality a'^b^"^ < aa + (1 — a)b to obtain that, with probability 
at least 1 — e~^, — D/n is not larger than 



(4 + 32r//9)L»3/4(ej,x2)V4 + 2^f2\fThP^ + Zv'^x + (3 + M/r/^ + S/^)ebx'^/9 



n 

For r] = 0.07, this gives 



2 i:>3/^(eb(192;)2)V4 + 2V2VDv'^x + Stj^x + eb(19x)2 

Zi — — > . 

n n 

For the second one we use Klein's version of Talagrand's inequality to obtain, for all x > 
such that r{x) = ^2{v^ + 26E(Z))x/n + 85x/3n < E(Z), 

< (E(Z) - r(x))2) < e"^. 

We have (E(Z) - r(x)Y = (E(Z))2 - 2E(Z)r(x) + r(x)2 > (E(Z))2 - 2E(Z)r(x), thus 

P (Z^ < {E{Z)f - 2E(Z)r(x)) < e""^. 
From the previous corollary, (E{Z)f > K{Z^) - E„^, thus 

P (Z^ < E(Z2) - - 2E(Z)r(x)) < e"^. 
In order to conclude the proof of 16.5^ just remark that 



f ^ ^ 4Z)3/4(e,x2)V4 + ^^/d^ + 16^^6^x2/3 
2E(Z)r(x) < 



n 

^ ('4 + 3277/9)D3/4fe,: 



(4 + 32??/9)L>3/4(ef,x2)i/4 + 3^!)^ + 16/(9r?2)ebx2 



n 

For = 0,0357, we obtain (jH]). 
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Finally, we have obtained the following result for the concentration of around its mean 



Corollary 6.6 For all x > 0, 



\ n n I ~ 

/ 2 £_ 8£>^/^(efcX^)^/^ + 7.6Wv^Dx + ej,(40.25x)^ \ 
\ n n I 

Proof : 

In order to obtain the second inequality, we remark that the inequality is trivial when 
X < 1, thus we only have to use (j^Tj) for x > 1 and then y/x > 1 and > 1. 

We will use this lemma to obtain a concentration inequality for totally degenerate U- 
statistics of order 2. The following result generalizes a previous inequality due to Houdre 
& Reynaud-Bouret [16] to random variables taking values in a measurable space. 

Lemma 6.7 Let X, Xi, Xn be i.i.d random variables taking value in a measurable space 
(X, A') with common law P. Let n be a measure on (X, A") and let (tx)xeA be a set of 
functions in L^(;u). Let 

B = {t = Y^ axtx, ai < 1}, I? = E (svip{t{X) - Pt)A , 
AeA agA ^ 

52 

v'^ = sup Var{t{X)), b = sup \\t\\^ and Cb = — . 



Let 

TT = 

n(n — 1) 

Then the following inequality holds 



1 

U = ^—^ Yl Y.(*xiX^)-Ptx)itxiX,)-Ptx). 



Vx > 0, F (^^ > 5.3m^/ne^^^)^/^ + 3V^ + 3.^x + e.(19.1x)^j ^ ^^^^ 
yx>0,F(u <-'^'^'^^^^'^'''^l'[^^ (43) 

Proof : 

Remark that, from Cauchy-Schwarz inequality. 



sup '^axl^nitx)] = y^(^'nfa))' 
>i<lAGA / AgA 



sup(i/„(t)) = 
teB \j2 

For all X in X, from Cauchy-Schwarz inequality, 

sup(t(x) - Pt)^ = y^itxix) - Ptxf, 
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in particular, D = ^^^(V'a(^))- Moreover, easy algebra leads to 

n 

AeA i=l AgA 

1 " 

+ E E(*A(X,)-PtA)(tA(X,)-PtA) 



Let Z2 = supigB(i/„(t))2, Ta = Ea6a(*a - Ptxf 



E{Z^) = E ( -PnTj,) = -. 

^ n In 



Hence 

U = ^(z'-E{Z')--Ur,{TA 
n — 1 \ n 

Prom Corollary 16.61 for all a; > 0, 

n n 



^2 D''/^{eb{19x)'^y/^ + SVv^Dx + 3t;^x + eb(19x) ^ ^_ 



I 77, 77 j ~ 

Moreover, from Bernstein inequality, for all x > 0, 

,rA>v^2^+^)<e-^ 
We apply inequality (jiO]) with a = D'^/^(ebX^)^/^, 6 = eh^/x, a = 2/3 and we obtain 



X 



Therefore, for all x > 0, 



^ ^ 5.31Z)3/4(e^2;2)i/4 + ^.^Q^D^ + 3^23; + ((19x)2 + (x + \/2^)/3) j ^ 2 



+ Cb ((40.25x)^ + (x + \/2x)/3) \ o q -x 

These inequalities are trivial when x < 1. We only use them when x > 1 and we obtain 
I2|) and (|33]) since x < x^ and t/x < x^ when x > 1. 
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Let us now state the corollary of Bernstein's inequality that we used repeatedly in the 
article. 

Lemma 6.8 Let X, Xi, ...,Xn be i.i.d random variables taking value in a measurable space 
(X, X) with common law P. Let fi be a measure on (X, X) and let {ipx)x^A be an or- 
thonormal system in L'^[fi). Let L be a linear functional in L'^[fi) and let B = {t = 
EAeA«A^(^A), EasA^a ^ 1}' = supjgij Var{t{X)), b = sup^g^ and Cb = b'^/n. 

Let u be a function in S, the linear space spanned by the functions ('i/'A)AGA ^^'^ let rj > 0. 
Then the following inequality holds 

Vx > 0, P f ^„(L(n)) > Jlinf + ?f^±3^) < g-. (44) 
\ 2 r]n J 

Proof : 

From Bernstein's inequality, 



n 3n 



Since t = L{u/\\u\\) belongs to B, 




2Var( L(^)(X))x ^ ^ / / 2Var(t(X))x ^ \\t\\^x \ 

n 2>n \ V n 3n J 

We conclude the proof using the inequality (a + 6)^ < 2a^ + 26^. 
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